MSR-MT: The Microsoft Research Machine Translation System
نویسندگان
چکیده
MSR-MT is a data-driven MT system that combines rule-based and statistical techniques with example-based transfer. This hybrid, large-scale system is capable of learning all its knowledge of lexical and phrasal translations directly from data. MSR-MT has undergone rigorous evaluation showing that, trained on a corpus of technical data similar to the test corpus, its output surpasses the quality of best-of-breed commercial MT systems. System builder: Natural Language Processing group, Microsoft Research Contact information: Bill Dolan, Microsoft Research, One Microsoft Way, Redmond WA 98052; email: [email protected] System Category: Advanced research prototype System Characteristics: (see description below) Hardware platform and operating system: PC, WindowsXP System operations specialist: Arul Menezes; email: [email protected] 1 System Description MSR-MT is a data-driven MT system that combines rule-based and statistical techniques with example based transfer [1]. We believe this hybrid system to be the first practical large-scale MT system capable of learning all its knowledge of lexical and phrasal translations directly from data. MSR-MT has undergone rigorous evaluation showing that, trained on a corpus of technical data similar to the test corpus, its output surpasses the quality of best-of-breed commercial MT systems. In addition, a large pilot study has shown that users judge the system’s output accurate enough to be useful in helping them perform highly technical tasks. Figure 1 below provides a simplified schematic of the architecture of MSR-MT. The central feature of the system’s training mode is an automatic logical form (LF) alignment procedure which creates the system’s translation example base from sentence-aligned bilingual corpora [2]. During training, statistical word association techniques [3] supply translation pair candidates for alignment and identify certain multiword terms. This information is used in conjunction with information about the sentences’ LFs, provided by robust, broad-coverage syntactic parsers, to identify phrasal transfer patterns. Figure 1. MSR-MT Architecture (from Richardson et al., 2001) At run-time, these same syntactic parsers are used to produce an LF for the input string. The goal of the transfer component is thus to identify translations for pieces of this input LF, and to stitch these matched pieces into a target language LF which can serve as input to generation. The example-based transfer component is augmented by decision trees that make probabilistic decisions about the relative plausibility of competing transfer mappings in a given target context [4]. The broad-coverage parsers used by MSR-MT were created originally for monolingual applications and have been used in Microsoft Word’s grammar checker [5]. Parsers now exist for seven languages (English, French, German, Spanish, Chinese, Japanese, and Korean), and active development continues to improve their accuracy and coverage. These analysis components rely on hand-crafted monolingual lexicons, but the only bilingual lexical resource available to the system is a small lexicon of function word translations; all other lexical translation information is learned automatically from data. Generation components are currently being developed for English, Spanish, Japanese, German, and French. The English, Spanish, and Japanese generation components have been built by hand, while for German the mapping from logical form to sentence string has been approached as a machine learning problem [6]. A machinelearned generation component for French is also under development. We have thus far created systems that translate into English from French, German, Spanish, Japanese, Chinese, and Japanese and that translate from English to Spanish, German, and Japanese. MSR-MT’s modular architecture means that, in principle, it is possible to rapidly create an MT system for any language pair for which the necessary parsing and generation components exist, along with a suitable corpus of bilingual data. A successful initial experiment in rapidly deploying a new system (FrenchSpanish) is described in [7]. We have also experimented preliminarily with Chinese to Japanese. While performance has not been a focus of our research, both training and translation time are fast enough to allow for interactive error analysis and development. Training a transfer database from a 300K sentence bilingual corpus (average sentence length is 18.8 words) takes about 1.5 hours on a cluster of 40 processors averaging 500MHz each. Run-time translation on a dual-processor 1 GHz PC averages about 0.31 seconds per sentence, or about 59 words per second. When trained on thousands of bilingual sentence pairs taken from the Microsoft technical domain, MSR-MT has been shown to yield translations that are superior to those produced by best-of-breed commercial MT systems. These claims are based on rigorous evaluations carried out by multiple (typically 6-7) human raters, each examining hundreds of test sentences. These test sentences are drawn from the same pool as the training sentences, and are thus similar in length and complexity. Test and training data are kept strictly separate, and the test data is blind to system developers. Human raters have no knowledge of the internal workings of either MSR-MT or the other MT systems, are employed by an independent vendor organization, and are given no indication of which system produced the translations they are rating. We believe this evaluation methodology to be one of the most rigorous described in the MT literature. In addition to these ongoing quality evaluations, a pilot study involving 60,000 Spanish users of Microsoft’s Spanish language technical support web site found that they were overwhelmingly satisfied with the quality of MSR-MT’s translations of English technical documentation. A random sample of approximately 400 users found that 84% were satisfied with the translation quality of the technical articles they accessed.
منابع مشابه
Generation for Multilingual MT
This paper presents an overview of the broad-coverage, application-independent natural language generation component of the NLP system being developed at Microsoft Research. It demonstrates how this component functions within a multilingual Machine Translation system (MSR-MT), using the languages that we are currently working on (English, Spanish, Japanese, and Chinese). Section 1 provides a sy...
متن کاملPost-MT Term Swapper: Supplementing a Statistical Machine Translation System with a User Dictionary
A statistical machine translation (SMT) system requires homogeneous training data in order to get domain-sensitive (or context-sensitive) terminology translations. If the data consists of various domains, it is difficult for an SMT system to learn context-sensitive terminology mappings probabilistically. Yet, terminology translation accuracy is an important issue for MT users. This paper explor...
متن کاملThe MSRA machine translation system for IWSLT 2010
This paper describes the systems of, and the experiments by, Microsoft Research Asia (MSRA), with the support of Microsoft Research (MSR), in the IWSLT 2010 evaluation campaign. We participated in all tracks of the DIALOG task (Chinese/English). While we follow the general training and decoding routine of statistical machine translation (SMT) and that of MT output combination, it is our first t...
متن کاملAchieving commercial-quality translation with example-based methods
We describe MSR-MT, a large-scale example-based machine translation system under development for several language pairs. Trained on aligned English-Spanish technical prose, a blind evaluation shows that MSR-MT’s integration of rule-based parsers, example based processing, and statistical techniques produces translations whose quality in this domain exceeds that of uncustomized commercial MT sys...
متن کاملOvercoming the customization bottleneck using example-based MT
We describe MSR-MT, a large-scale hybrid machine translation system under development for several language pairs. This system’s ability to acquire its primary translation knowledge automatically by parsing a bilingual corpus of hundreds of thousands of sentence pairs and aligning resulting logical forms demonstrates true promise for overcoming the so-called MT customization bottleneck. Trained ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002